Overview

My project aims to summarize and visualize the trends in the use of animals in biomedical research by species, pain levels and state in the US the over the period from 2008 to 2019. The data is obtained from the US Department of Agriculture’s APHIS annual reports. Following this exploratory analysis, a prediction for the number of animals that will be used over the next 5 years will be made.

Introduction

The use of animals in biomedical research is ubiquitous. There are two main categories for animal use: basic biomedical sciences research and (mimicking a human disease in an animal model and studying gene expression, molecular mechanisms, etc.) and drug testing (toxicity and efficacy of an experimental drug). A growing body of studies and voices from the scientific community have pointed out the poor reliability and predictive value of animal models for human outcomes and for understanding human physiology. In 2004, the FDA estimated that 92% of drugs that pass preclinical tests in animal models fail in human clinical trials. More recent analysis suggests that, in spite of efforts to improve the predictability of animal testing, the failure rate has increased to 96%.

In 1938, Congress passed the U.S. Federal Food, Drug, and Cosmetic Act, mandating animal toxicity testing. As of October 7, 2021, Congress introduced the bipartisan FDA Modernization Act to end animal testing mandates. This comes after the European Parliament resoundingly passed a resolution on September 16, 2021, with a vote of 667 to 4 to phase out animal testing. How has animal use in research changed over the past few years in the US? What is the significance of species, state, pain levels? Using the limited USDA data, I aim to answer these questions. Furthermore, a regression model will predict the number of animals that will be needed in the next few years if the same trends continue. This analysis could be of interest to biomedical companies to reduce their time and resources spent on animal models, FDA for revision of regulatory requirements, bioethics specialists and animal advocacy groups.

The trends in animal use could be influenced by technologies that enhanced the ease of building animal models, technologies that performed better than animal models, change in bioethical standards, change in public sentiment about the topic and even opinions voiced by the heads of government regulatory bodies. I am personally interested in this analysis because one of my career goals is to work towards developing and commercializing alternatives to animal methods in biomedical research.

Cleaning and Loading Data

USDA publishes an annual report of the number of animals used by state, species and pain category. The report for every year is published in January two years later. Hence the latest data I could obtain was from 2019. The reports were published starting 2008, hence I have data for a 12 year period.

The data is in the form of several PDFs. I cleaned the data manually into .txt format. The nomenclature is year_col_X where X is the pain category. The pain categories are: - Column B: Animals held by a facility but not used in any research that year - Column C: Animals used in research; no pain involved; no pain drugs administered - Column D: Animals used in research; pain involved; pain drugs administered - Column E: Animals used in research; pain involved; no pain drugs administered.

The following code chunk loads the data as a dataframe, stacked by year and column. The loaded data has 53 rows per year per column. This includes the 50 states, District of Columbia, Puerto Rico, and “REPORT TOTAL” which is the sum of all columns (species). Note that the species column is not an exhaustive list of all animals used in biomedical research but only those covered by the Animal Welfare Act. The AWA does not cover rats, mice, birds and reptiles, which happen to be over 95% of all animals used. Columns B, C, D and E are assigned values 0, 1, 2 and 3 to reflect pain levels in a more intuitive way.

library("tidyverse")
library("dplyr")

datafr <- purrr::map_dfr(
  .x = c(2008:2019), 
  .f = function(x){
    purrr::map_dfr(
  .x = c("B", "C", "D", "E"),
  .f = function(x, y){
    dat <- read.table(file = paste0("C:/Users/Deeksha Hegde/Downloads/BMIN503_Final_Project/USDA_Data/",y,"_col_",x,".txt"), header = TRUE, sep = "\t")
    dat %>%
      mutate(across(.cols = All_Other_Covered_Species:Total, .fns =~ as.numeric(str_remove_all(string = .x, ","))), 
        Year = y,
        Column = x
            )
    }, y=x
    )
  }
)

#Checking if the data loaded is correct
datafr %>% count(Year, Column)
##    Year Column  n
## 1  2008      B 53
## 2  2008      C 53
## 3  2008      D 53
## 4  2008      E 53
## 5  2009      B 53
## 6  2009      C 53
## 7  2009      D 53
## 8  2009      E 53
## 9  2010      B 53
## 10 2010      C 53
## 11 2010      D 53
## 12 2010      E 53
## 13 2011      B 53
## 14 2011      C 53
## 15 2011      D 53
## 16 2011      E 53
## 17 2012      B 53
## 18 2012      C 53
## 19 2012      D 53
## 20 2012      E 53
## 21 2013      B 53
## 22 2013      C 53
## 23 2013      D 53
## 24 2013      E 53
## 25 2014      B 53
## 26 2014      C 53
## 27 2014      D 53
## 28 2014      E 53
## 29 2015      B 53
## 30 2015      C 53
## 31 2015      D 53
## 32 2015      E 53
## 33 2016      B 53
## 34 2016      C 53
## 35 2016      D 53
## 36 2016      E 53
## 37 2017      B 53
## 38 2017      C 53
## 39 2017      D 53
## 40 2017      E 53
## 41 2018      B 53
## 42 2018      C 53
## 43 2018      D 53
## 44 2018      E 53
## 45 2019      B 53
## 46 2019      C 53
## 47 2019      D 53
## 48 2019      E 53
datafr <- mutate(datafr, pain.level = factor(Column, levels = c("B", "C", "D", "E"), labels = c(0, 1, 2, 3)))

Data Visualization and Exploratory Analysis

Now let us visualize the numbers! The first plot shows the total animals (all categories combined) by year. The second plot shows the same but with the breakup by column.

data.total <- filter(datafr, State == "REPORT TOTAL")

#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))

ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Total by year and column
total.plot <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line() + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )
total.plot

We see that there is a huge rise in the total from 2009 to 2010, and a steady decline since then. Overall, there is a significant decline over the 12 year period.

From the second plot, we see that pain category 1, which involves no pain and no pain drugs, has the highest number throughout.

#Total by column by state
data.by.state <- datafr %>%
  filter(State != "REPORT TOTAL") %>%
  group_by(State, pain.level) %>%
  summarise(Total = sum(Total))

data.by.state.total <- group_by(data.by.state, State) %>%
  summarize(Total.all = sum(Total))
ggplot(data.by.state.total, aes(x = State, y = Total.all)) + geom_point(size = 4) +
theme(
        plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
        axis.text.y = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

ggplot(data.by.state, aes(x = State, y = Total, color = pain.level)) + geom_point(size = 4) +
theme(
        plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 0.5),
        axis.text.y = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

From the graphs, we get the following information:

States using the highest number of animals overall across all pain levels: CA, MA, NJ

States using the highest number of animals by pain level: - 0 CA, MD, TX - 1 CA, MA, OH - 2 CA, TX, MA - 3 MO, MI, IA

An arguably better representation of the same data is the use of interactive maps.

library("sf")
## Linking to GEOS 3.9.1, GDAL 3.2.1, PROJ 7.2.1
library("spData")
## Warning: package 'spData' was built under R version 4.1.2
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
  rename(State = stusab)
## Reading layer `us-state-boundaries' from data source 
##   `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS:  WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS:  WGS 84
##   gid    arealand division intptlat
## 1  16   278176477        0 18.21765
## 2  23   472276664        0 14.93678
## 3  31  1627312771        7 34.89553
## 4  35  2136109036        5 38.64729
## 5  40 -1616974352        1 41.59742
## 6  54   312831514        9 47.40732
##                                           name objectid   areawater   intptlon
## 1                                  Puerto Rico       50   628200285  -66.41080
## 2 Commonwealth of the Northern Mariana Islands       36   349301029  145.60102
## 3                                     Arkansas       44 -1334552525  -92.44463
## 4                                West Virginia        1   489848791  -80.61833
## 5                                 Rhode Island        6  1323457457  -71.52727
## 6                                   Washington       20  -324557627 -120.57580
##           oid funcstat    centlon State state  statens  centlat
## 1   303146031        A  -66.41476    PR    72 01779808 18.21647
## 2 -1625647860        A  145.59687    MP    69 01779809 16.79744
## 3   266078934        A  -92.44270    AR    05 00068085 34.89402
## 4 -1929409300        A  -80.62346    WV    54 01779805 38.64119
## 5 -1861167639        A  -71.52472    RI    44 01219835 41.59403
## 6 -1859906639        A -120.59557    WA    53 01779804 47.41490
##                                       basename mtfcc region lsadc geoid
## 1                                  Puerto Rico G4000      9    00    72
## 2 Commonwealth of the Northern Mariana Islands G4000      9    00    69
## 3                                     Arkansas G4000      3    00    05
## 4                                West Virginia G4000      3    00    54
## 5                                 Rhode Island G4000      1    00    44
## 6                                   Washington G4000      4    00    53
##                         geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
to_map <- inner_join(usa, data.by.state.total, by = "State")

library(mapview)
library(leaflet)
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map$State,  
                     "<br>Number of animals used: ",       
                     prettyNum(to_map$Total.all, big.mark = ","))

leaflet(to_map) %>%
  addPolygons(stroke = FALSE,                        # remove polygon borders
              fillColor = ~pal_fun(Total.all),
              fillOpacity = 0.8, smoothFactor = 0.5, 
              popup = pu_message) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addLegend("bottomright",                           
            pal=pal_fun,                             
            values=~Total.all,                 
            title = 'Total number of animals used',                  
            opacity = 1) %>%                         
  addScaleBar()
library("sf")
library("spData")

usa <- st_read("C:/Users/Deeksha Hegde/Downloads/us-state-boundaries/us-state-boundaries.shp") %>%
  rename(State = stusab)
## Reading layer `us-state-boundaries' from data source 
##   `C:\Users\Deeksha Hegde\Downloads\us-state-boundaries\us-state-boundaries.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 56 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.2311 ymin: -14.60181 xmax: 179.8597 ymax: 71.44069
## Geodetic CRS:  WGS 84
head(usa)
## Simple feature collection with 6 features and 20 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.8485 ymin: 14.03656 xmax: 146.1544 ymax: 49.00241
## Geodetic CRS:  WGS 84
##   gid    arealand division intptlat
## 1  16   278176477        0 18.21765
## 2  23   472276664        0 14.93678
## 3  31  1627312771        7 34.89553
## 4  35  2136109036        5 38.64729
## 5  40 -1616974352        1 41.59742
## 6  54   312831514        9 47.40732
##                                           name objectid   areawater   intptlon
## 1                                  Puerto Rico       50   628200285  -66.41080
## 2 Commonwealth of the Northern Mariana Islands       36   349301029  145.60102
## 3                                     Arkansas       44 -1334552525  -92.44463
## 4                                West Virginia        1   489848791  -80.61833
## 5                                 Rhode Island        6  1323457457  -71.52727
## 6                                   Washington       20  -324557627 -120.57580
##           oid funcstat    centlon State state  statens  centlat
## 1   303146031        A  -66.41476    PR    72 01779808 18.21647
## 2 -1625647860        A  145.59687    MP    69 01779809 16.79744
## 3   266078934        A  -92.44270    AR    05 00068085 34.89402
## 4 -1929409300        A  -80.62346    WV    54 01779805 38.64119
## 5 -1861167639        A  -71.52472    RI    44 01219835 41.59403
## 6 -1859906639        A -120.59557    WA    53 01779804 47.41490
##                                       basename mtfcc region lsadc geoid
## 1                                  Puerto Rico G4000      9    00    72
## 2 Commonwealth of the Northern Mariana Islands G4000      9    00    69
## 3                                     Arkansas G4000      3    00    05
## 4                                West Virginia G4000      3    00    54
## 5                                 Rhode Island G4000      1    00    44
## 6                                   Washington G4000      4    00    53
##                         geometry
## 1 MULTIPOLYGON (((-67.20794 1...
## 2 MULTIPOLYGON (((145.5726 15...
## 3 MULTIPOLYGON (((-94.55218 3...
## 4 MULTIPOLYGON (((-81.74725 3...
## 5 MULTIPOLYGON (((-71.7897 41...
## 6 MULTIPOLYGON (((-123.2479 4...
pl0 <- filter(data.by.state, pain.level == 0)
to_map_0 <- inner_join(usa, pl0, by = "State")

library(mapview)
library(leaflet)
pal_fun <- colorNumeric("BuPu", NULL)
pu_message <- paste0(to_map_0$State,  # paste0 to append tract name with other relevant text 
                     "<br>Number of animals used: ",       # <br> forces new line
                     # use round function to round continuous poverty rate to one decimal point
                     prettyNum(to_map_0$Total, big.mark = ","))

leaflet(to_map_0) %>%
  addPolygons(stroke = FALSE,                        # remove polygon borders
              fillColor = ~pal_fun(Total),
              fillOpacity = 0.8, smoothFactor = 0.5, # increase opacity and resolution
              popup = pu_message) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%   # add third party provider tile
  #addProviderTiles(providers$Stamen.Toner) %>%
  #addProviderTiles(providers$Esri.NatGeoWorldMap)
  addLegend("bottomright",                           # location of legend
            pal=pal_fun,                             # palette function
            values=~Total,                 # variable to be passed to palette function
            title = 'Total number of animals used at pain level 0',                  # legend title
            opacity = 1) %>%                         # legend opacity (1 = completely opaque)
  addScaleBar()

Repeat for pain levels 1, 2, 3. How do I make all maps appear in one grid? Different colors for pain levels or same? How to make a loop/function to repeat for all pain levels?

Now, since we’re in Pennsylvania, let us see what the plots for PA looks like.

#Total for PA
data.PA <- filter(datafr, State == "PA")
ggplot(data = summarise(group_by(data.PA, Year), Total = sum(Total)), aes(x = Year, y = Total)) + geom_line() + scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year in Pennsylvania")

#Total by column for PA
ggplot(data = data.PA, aes(x = Year, y = Total, color = pain.level)) + geom_line() +
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Pennsylvania")

We see a huge decline from 2009 to 2015.

Next, I want to find out which states contributed the most to the decline observed in the first plot. I calculated the mean and standard deviation over the years for each state and sorted them in decreasing order of SD. Noticing that Arizona seems to have a higher SD than mean, I calculated the coefficient of variability and sorted based on this.

library(dplyr)
#Std dev of total for all states
data.by.state.year <- datafr %>%
  filter(State != "REPORT TOTAL") %>%
  group_by(State, Year) %>%
  summarise(Total = sum(Total))
  
state.std.dev <- summarise(data.by.state.year, mean = mean(Total), sd = sd(Total))
state.std.dev[order(state.std.dev$sd, decreasing = TRUE),]
## # A tibble: 52 x 3
##    State    mean     sd
##    <chr>   <dbl>  <dbl>
##  1 AZ     16261. 35984.
##  2 NY     45696. 25676.
##  3 CA    103133. 22675.
##  4 KS     11863. 20574.
##  5 PA     50591. 20247.
##  6 OH     64393. 18341.
##  7 IA     30172  15683.
##  8 NJ     67210. 13946.
##  9 MO     36277. 13905.
## 10 MD     58124. 13290.
## # ... with 42 more rows
state.std.dev <- state.std.dev %>% mutate(cv = sd/mean) %>% 
                  arrange(desc(cv))
state.std.dev
## # A tibble: 52 x 4
##    State   mean     sd    cv
##    <chr>  <dbl>  <dbl> <dbl>
##  1 AZ    16261. 35984. 2.21 
##  2 KS    11863. 20574. 1.73 
##  3 WV     2116.  2936. 1.39 
##  4 NE     4406.  3886. 0.882
##  5 AK      636.   468. 0.735
##  6 HI      337    227. 0.675
##  7 DC     7995.  5279. 0.660
##  8 WY      543.   307. 0.565
##  9 NY    45696. 25676. 0.562
## 10 IN    17136.  9132. 0.533
## # ... with 42 more rows

The states having CV > 1 will be examined further.

#Total by column for AZ

data.AZ <- filter(datafr, State == "AZ")
ggplot(data = data.AZ, aes(x = Year, y = Total, color = Column)) + geom_line()+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona")

Column C of the year 2010 is clearly an outlier. I verified the data again from the USDA report. It is most likely a typographical error. On examination, I found that the outlier is coming from the All_Other_Covered_Species. I first replaced that value with NA and found the mean along the column for the rest of the years. I replaced the NA with the rounded mean and finally updated the total. The updated plot for AZ is also shown.

#Setting the outlier point to NA
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- NA

#Setting the outlier point to the mean of the column over the years excluding outlier year
data.AZ[, 2][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- round(mean(data.AZ[, 2][data.AZ['Column'] == "C"], na.rm = TRUE), digits = 0)

#Updating the Total column for the year
data.AZ[, 'Total'][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010] <- sum(data.AZ[2:12][data.AZ['Column'] == "C" & data.AZ['Year'] == 2010])

#Updated plot with outlier adjusted
ggplot(data = data.AZ, aes(x = Year, y = Total, color = Column)) + geom_line()+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Arizona (Outlier adjusted)")

The second state having CV > 1 is Kansas. Let’s look at its plot.

#Total by column for KS

data.KS <- filter(datafr, State == "KS")
ggplot(data = data.KS, aes(x = Year, y = Total, color = Column)) + geom_line()+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in Kansas")

2019 column C is likely to be an outlier but unlike the case with Arizone, we cannot be sure since we don’t have the data for the years after 2019.

Finally, we have West Virginia.

#Total by column for WV

data.WV <- filter(datafr, State == "WV")
ggplot(data = data.WV, aes(x = Year, y = Total, color = Column)) + geom_line()+
  scale_x_continuous(breaks = c(2008:2019)) + ggtitle("Total number of animals vs. Year by Pain Levels in West Virginia")

WV does not seem to have outliers, since there is an increase and decrease in all columns over 4 years from 2015-2019.

Now that we have investigated outliers, let us go back to the Total number of animals vs. Year plots and revise them.

datafr[datafr['State'] == "AZ" & datafr['Column'] == "C" & datafr['Year'] == 2010, ] <- data.AZ[data.AZ['Column'] == "C" & data.AZ['Year'] == 2010, ]
datafr[530, 2:13] <- colSums(datafr[478:529, 2:13])

data.total <- filter(datafr, State == "REPORT TOTAL")

#Total by year
data.total1 <- summarise(group_by(data.total, Year), Total = sum(Total))
ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(900000, 1300000, by = 100000), labels = seq(9, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

#Total by year and column
total.plot <- ggplot(data = data.total, aes(x = Year, y = Total, color = pain.level)) + geom_line() + scale_x_continuous(breaks = c(2008:2019)) + scale_y_continuous(breaks = seq(100000, 700000, by = 100000), labels = seq(1, 7, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year by Pain Levels") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )
total.plot

datafr[datafr['State'] == "REPORT TOTAL" & (datafr['Year'] == 2015 | datafr['Year'] == 2016), ]
##             State All_Other_Covered_Species  Cats  Dogs Guinea_Pigs Hamsters
## 1537 REPORT TOTAL                     36221  1149  6080       14033     8309
## 1590 REPORT TOTAL                     86618 12586 40071      120519    45863
## 1643 REPORT TOTAL                     39652  7288 20668       29927    20000
## 1696 REPORT TOTAL                      3796    58   362       22418    32557
## 1749 REPORT TOTAL                     44318  1276  5781       19953     4210
## 1802 REPORT TOTAL                    158863 12784 41835      126355    50178
## 1855 REPORT TOTAL                     51237  6171 18369       30440    17568
## 1908 REPORT TOTAL                      3434    20   699       24462    33279
##      Marine.Mammals Nonhuman_Primates Other_Farm_Animals   Pig Rabbits Sheep
## 1537             NA             43634               4419  5654   15662  1364
## 1590             NA             32962              16037 10194   77838  3284
## 1643             NA             27972              10416 35422   53998  7275
## 1696             NA              1016               1333   861    6512   119
## 1749             NA             38632                 NA  5833   16124  1105
## 1802             NA             42650                 NA 10553   82997  4288
## 1855             NA             27079                 NA 35516   51985  7838
## 1908             NA              1272                 NA  2168    5743    93
##       Total Year Column pain.level
## 1537 136525 2015      B          0
## 1590 445972 2015      C          1
## 1643 252618 2015      D          2
## 1696  69032 2015      E          3
## 1749 137232 2016      B          0
## 1802 530503 2016      C          1
## 1855 246203 2016      D          2
## 1908  71170 2016      E          3
data.total1 <- mutate(data.total1, percentage.change = (Total/lag(Total) -1)*100) 
data.total2 <- mutate(data.total1, percentage.change = 100*(Total - first(Total))/Total) #This gives the same as first plot with different y axis
#This tells us that the overall change in 12 years is a 24% reduction.

ggplot(data = data.total1, aes(x = Year, y = percentage.change)) + geom_line(na.rm = TRUE) + scale_x_continuous(breaks = c(2008:2019))

Now we observe a more steady decline in the total number of animals over the years.

#Nested functions for 50 plots

#This chunk not needed?
state_plot <- purrr::map(
   .x = unique(datafr$State)[unique(datafr$State) != "REPORT TOTAL"], 
  .f = function(x){
    ggplot(data = datafr %>% filter(State == x), aes(x = Year, y = Total, color = pain.level)) + geom_line() + ggtitle(label = x) +
  scale_x_continuous(breaks = c(2008:2019)) +
      theme(
        plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
        axis.title = element_blank(),
        axis.text = element_text(size = 8),
        legend.position = "none"
      )
  }
)

ggplot(data = datafr %>% filter(State != "REPORT TOTAL"), aes(x = Year, y = Total, color = pain.level)) + geom_line(size = 2) + ggtitle("Total number of animals used by each state by pain levels") +
  scale_x_continuous(breaks = c(2008:2019)) + facet_wrap( ~ State, ncol = 5, scales = "free") + theme(
        strip.text.x = element_text(size = 28, color = "blue", face = "bold"),
        plot.title = element_text(size = 50, face = "bold", hjust = 0.5),
        axis.title = element_blank(),
        axis.text = element_text(size = 8),
        legend.position = "bottom",
        legend.justification = "right",
        legend.title = element_text(size = 50),
        legend.text = element_text(size = 48)
      )

#Bigger legend
#Line patterns - don't bother about y axis, quantity doesn't matter

Doing the same standard deviation analysis for species. Do bar plots for pain levels.

library("ggplot2")
#Std dev of species for top states
data.CA <- filter(datafr, State == "CA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
#colMeans(data.CA, na.rm = TRUE)
#full join
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd)) %>% full_join(

filter(datafr, State == "CA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = ~mean(x = .x, na.rm = TRUE))) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "mean"), by = "species")

ggplot(data.CA, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + geom_bar(mapping = aes(y = mean), stat="identity", fill="steelblue", width=0.5, alpha = 0.5, position = "dodge") + ggtitle("Standard deviation of species over 12 years in CA") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

data.CA.1 <- filter(datafr, State == "CA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year, pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")

  ggplot(data.CA.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", width = 0.5) + ggtitle("Standard deviation of species over 12 years in CA by pain level") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

library("ggplot2")
#Std dev of species for top states
data.MA <- filter(datafr, State == "MA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd))

ggplot(data.MA, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + ggtitle("Standard deviation of species over 12 years in MA") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

data.MA.1 <- filter(datafr, State == "MA") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year, pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")

  ggplot(data.MA.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge", width = 0.5) + ggtitle("Standard deviation of species over 12 years in MA by pain level") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

library("ggplot2")
#Std dev of species for top states
data.NJ <- filter(datafr, State == "NJ") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
#colMeans(data.NJ, na.rm = TRUE)
  #mean needed?
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd") %>%
  arrange(desc(sd))

ggplot(data.NJ, aes(x = species, y = sd)) + geom_bar(stat="identity", fill="steelblue", width=0.5) + ggtitle("Standard deviation of species over 12 years in NJ") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

data.NJ.1 <- filter(datafr, State == "NJ") %>%
  filter(State != "REPORT TOTAL") %>%
  select(-State) %>%
  group_by(Year, pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sum, na.rm = TRUE)) %>%
  ungroup() %>%
  group_by(pain.level) %>%
  summarise(across(.cols = All_Other_Covered_Species:Total, .fns = sd, na.rm = TRUE)) %>%
  pivot_longer(cols = All_Other_Covered_Species:Total, names_to = "species", values_to = "sd")

  ggplot(data.NJ.1, aes(x = pain.level, y = sd, fill = species)) + geom_bar(stat = "identity", position = "dodge", width = 0.5) + ggtitle("Standard deviation of species over 12 years in NJ by pain level") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text.x = element_text(size = 8, angle = 90, vjust = 0.5, hjust = 0.5),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

m1 <- lm(Total ~ Year, data = data.total1)
summary(m1)
## 
## Call:
## lm(formula = Total ~ Year, data = data.total1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85230 -18560  -5143  25807  52073 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 54799007    6887537   7.956 0.0000124 ***
## Year          -26705       3421  -7.807 0.0000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40910 on 10 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.845 
## F-statistic: 60.95 on 1 and 10 DF,  p-value: 0.00001458
predict(m1, newdata = data.total1)
##         1         2         3         4         5         6         7         8 
## 1176308.3 1149603.7 1122899.2 1096194.7 1069490.1 1042785.6 1016081.1  989376.5 
##         9        10        11        12 
##  962672.0  935967.5  909262.9  882558.4
data.total.1.predict <- data.frame(year = 2019:2024, predict(m1, newdata = tibble(Year = 2019:2024), interval = "predict"))

ggplot(data = data.total1, aes(x = Year, y = Total)) + geom_line() + geom_smooth(formula = "y~x", method = "lm") + geom_line(data = data.total.1.predict, mapping = aes(x = year, y = fit), linetype = "dashed", color = "blue") + scale_x_continuous(breaks = c(2008:2024)) + scale_y_continuous(breaks = seq(700000, 1300000, by = 100000), labels = seq(7, 13, 1)) + xlab(label = "Year") + ylab(label = "Total in hundred thousands") + ggtitle("Total number of animals vs. Year") +
theme(
        plot.title = element_text(size = 10, face = "bold", hjust = 0.5),
        axis.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        legend.position = "bottom",
        legend.direction = "horizontal",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 10)
      )

m.quad <- lm(Total ~ Year+Year.sq, data = mutate(data.total1, Year.sq = Year*Year))
summary(m.quad)
## 
## Call:
## lm(formula = Total ~ Year + Year.sq, data = mutate(data.total1, 
##     Year.sq = Year * Year))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80005 -24414  -3277  28059  54767 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)
## (Intercept) 2245798614.7 4728853108.7   0.475    0.646
## Year          -2203020.4    4697157.1  -0.469    0.650
## Year.sq            540.4       1166.4   0.463    0.654
## 
## Residual standard error: 42610 on 9 degrees of freedom
## Multiple R-squared:  0.8623, Adjusted R-squared:  0.8317 
## F-statistic: 28.19 on 2 and 9 DF,  p-value: 0.0001333
#make another column to scale year 0-12?